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Human decision behaviour is quite diverse. In many games humans on 
average do not achieve maximal payoff and the behaviour of individual play¬ 
ers remains inhomogeneous even after playing many rounds. For instance, in 
repeated prisoner dilemma games humans do not always optimize their mean 
reward and frequently exhibit broad distributions of cooperativity. The rea¬ 
sons for these failures of maximization are not known. Here we show that the 
dynamics resulting from the tendency to shift choice probabilities towards 
previously rewarding choices in closed loop interaction with the strategy of 
the opponent can not only explain systematic deviations from ’rationality’, 
but also reproduce the diversity of choice behaviours. As a representative ex¬ 
ample we investigate the dynamics of choice probabilities in prisoner dilemma 
games with opponents using strategies with different degrees of extortion 
and generosity. We hnd that already a simple model for human learning 
can account for a surprisingly wide range of human decision behaviours. 
It reproduces suppression of cooperation against extortionists and increas¬ 
ing cooperation when playing with generous opponents, explains the broad 
distributions of individual choices in ensembles of players, and predicts the 
evolution of individual subjects’ cooperation rates over the course of the 
games. We conclude that important aspects of human decision behaviours 
are rooted in elementary learning mechanisms realised in the brain. 
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Humans and animals regularly fail to achieve maximal payoff even in simple choice sit¬ 
uations and also when allowed for trying them out with many moves. Such observations 
have triggered a plethora of research into the lack of ’rationality’ in human and animal 
decision behaviour. Deviations from utility maximization were explained by psycholog¬ 
ical peculiarities including e.g. probability matching [1] and mis-estimations of utilities 
[2]. Also, auxiliary motives or emotions can have a substantial influence on decision be¬ 
haviour which cannot always be excluded experimentally [3]. Last, not least exploration 
and learning mechanisms are increasingly suspected to contribute substaintially to de¬ 
viations from strict rationality in repeated games (for an encompassing review see [4]). 
Here, wide, skewed, and U-shaped distributions of choice propensities were occasionally 
observed [1, 4, 5, 6].While these hndings hint to the existence of multiple attractors for 
the choice propensities in analogy to similar considerations in evolutionary game theory 
[7, 8], this link has not been established yet for human learning in economic games. 

Social dilemma games have served as a paradigm in theoretical research of the evolu¬ 
tion of cooperativity. Being particularly well understood they provide a testbed for inves¬ 
tigating human deviations from optimal decision behaviour. In the prisoner’s dilemma 
two players, named X and Y, are playing a game in which they can either cooperate 
or defect. In the case of mutual cooperation both get a reward rcc- If X defects and 
Y cooperates, X gets an even larger reward r^c while Y gets an lower reward tcd- If 
both players defect, both get an reward r^f). These rewards must satisfy two equa¬ 
tions: rue > ^cc > ^'DD > 'Tcd leads to a Nash Equilibrium of mutual defection, while 
‘^f’cc > fDc fCD guarantees, that mutual cooperation has the best global outcome. 
The latter condition establishes the dilemma between individual gain for defection and 
larger common beneht for cooperation. In the following we use the standard values 
dehned in table 1, which Axelrod used for his famous computer tournament [9]. 


Table 1; Dehnition of the reward r 


Computer: X 



cooperate x = C 

defect X = D 

^ cooperate 1 /= C 

0.3 (rcc) 

0.3 (rcc) 

0.5 (rcc) 

0.0 {ren) 

defect y = D 

0.0 {tcd) 

0.5 (rcc) 

0.1 (rez?) 

0.1 {rjjf)) 


Recently a new class of strategies for the prisoners dilemma was introduced by Press 
and Dyson [10], the so called Zero Determinant (ZD) Strategies. ZD strategies are 
memory-one strategies, i.e. they only use the information from the last round to base 
their decision. It can be shown that against ZD strategies longer memory strategies 
cannot have any advantage over memory-one strategies. ZD strategies have two further 
characteristics: (i) they enforce a linear relationship between the rewards of the players; 
(ii) for positive slope of this relation the best strategy for the co-player, meaning the 
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Figure 1: Panel a) and c) show conditional expected rewards for strong extortion and 
strong generosity, respectively, for the context free player using a single coop¬ 
eration rate p. The black line is the expected reward r(p—y = C) given the 
player cooperated and the green line the expected reward r(p—y = D) given 
the player defected. The yellow and orange lines are these conditional expected 
rewards weighted with their corresponding probabilities p and 1-p yielding the 
expected rewards from cooperative and defective acts, respectively. The learn¬ 
ing rule has an unstable hxed point p* where these functions cross. Panel b) 
and d) show the total expected rewards of the player (red) and of the respective 
opponent ZD strategy (blue). 
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strategy with the highest reward is total cooperation. In this paper we will concentrate 
on two extreme types of these ZD strategies: An extortion strategy which dominates 
any adaptive opponent [10] and a generous strategy which ensures that her own reward 
does not exceed the reward of the co-player [11], Figure 1 show the expected reward of a 
purely stochastic player (Y) having a single hxed cooperation probability playing when 
against the extortion lb) or the generous strategy Id) of X. For this simple strategy of 
Y, that takes no memory into account, but also for all other possible strategies of Y, 
total cooperation always leads to the maximum reward [10]. 

An elegant experiment by Hilbe et ah demonstrated a particularly striking lack of 
cooperation in humans [12] when repeatedly playing prisoner dilemma games against 
zero-determinant strategies that extort their opponents [10]. In contrast, when playing 
against generous ZD-strategies, human cooperation rates increased with time. As an 
explanation for this substantially sub-optimal behaviour against extortionists a desire 
for punishment was offered. 

This experiment also revealed the large diversity of individual choice propensities (’co- 
operativities’) sometimes leading to bimodal (’U-shaped’) distributions after a series of 
moves, where some participants ended up always defecting while others always cooper¬ 
ated [12, supplement. Fig. 1]. 

Obviously, the dynamics of choice propensities during the course of a game can lead to 
opposite behaviours in different individuals. We wondered if the coevolution of learning 
in the human player in interaction with the opponent’s strategy could yield a dynamics 
that would explain this wide spread of individual behaviours. 

Adaptations of human and animal behaviour in repeated games have been studied 
extensively in economics [13, 14, 15]. Since the pioneering work of Pavlov more than 
hundred years ago research into reward dependent learning of choice preferences brought 
about a range of hypothetical learning rules based on success-dependent reinforcement of 
behaviour. It turned out that adaptations of animal as well as human behaviours are far 
from trivial even in very simple choice situations (for games see the review [4]). Already 
the phenomenology of Classical Conditioning can be explained only partially by the 
famous Rescorla-Wagner learning rule [16]. Therefore, more sophisticated algorithms 
for reinforcement learning were introduced, among which TD-learning [17] is now a 
leading paradigm in behavioural brain research and the Bush-Mosteller-Rule (BMR) 
[18] is frequently used in Psychology. 

All psychological learning rules share the basic principle of reinforcing choice probabil¬ 
ities for successful choices and/or weakening choice probabilities for negatively rewarded 
decisions (Thorndike’s famous ’law of effect’,[19]). Systematic biases either inherent in 
the particular learning rule or in specific parameter settings were frequently used to 
explain systematic deviations from optimal behaviour. As a representative example we 
here only cite a recent careful study in which success dependent learning effects in pro¬ 
fessional basketball players were explained by different learning rates for updating the 
respective choice probabilities [20]. 

Several studies investigated adaptations of choice probabilities in their interaction with 
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the dynamics of the environment. An important example here is the work of Herrnstein 
[1], which shows that neither probability matching nor payoff maximization can explain 
hnman decision behaviour when the response of the environment is history dependent. 
To our knowledge the first mathematical analysis of the dynamics of choice probabilities 
was performed by Izquierdo and co-workers [21], who used the Bush-Mosteller-Rule to 
investigate the putative flow of cooperation probabilities of two players in social dilemma 
games. While their formal method is interesting from a theoretical perspective, the con¬ 
crete predictions of this study are difficult to test since this would require knowledge of 
the strategies used by both human players which is not available. 

The case considered by Hilbe et ah [12] is more useful in this respect since the op¬ 
ponent’s fixed zero determinant strategies are analytically fully transparent [10]. We 
use the data from these experiments as a benchmark for investigating the evolution of 
individual choice probabilities in interaction with a dynamic environment. 


Results 

We model the choice behaviour y{t) G {C, D} of a subject Y in prisoner dilemma games 
using a stochastic strategy based on cooperation probabilities Pgit). Given the context 
Q, and the decision at (discrete) time t = 1,..., T the player receives a reward r{t) which 
depends on the behaviour x(t) G {G, D} of the opponent X. In general Q can consist of 
any information available during the game. 

Here we restrict our analysis to three cases: Q representing a) no context, b) the choice 
x(t — l) of the opponent X in the last round, and c) the combination {x(t — l),y(t — l)) of 
the choices of X and Y in the last round. Since taking more history into account cannot 
help Y to increase her income when playing against Zero-Determinant (ZD) strategies 
[10] it is not considered. 

The choice probabilities pgit) are adapted according to the success of behaviour by a 
simple learning rule : 


Pq(^ + 1) =PqW + 


+ecr{t)6g^g> for y = C 
-eDrit)5g^g> ioT y = D 


( 1 ) 


where ec,D are the learning rates in the case of cooperation and defection, respectively. 
Sc > Sd leads to a bias b towards cooperation. Sg^g/ is the Kronecker-Delta which equals 
one only if Q matches the context Q' that was present at time t, and is zero otherwise. 


With no context (case a) the index Q can be dropped. The expected change (Ap) = 
{pit + 1) — pit)) of the cooperation probability p can be calculated (see Methods) and 
depends on the difference of the expected reward for cooperation fc and that of defec¬ 
tion fr,. Fig. 1, a,c shows the results for the games against extortionist and generous 
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generous strategies extortion strategies 




Figure 2: Fraction of cooperating subjects during the course of the game from the ex¬ 
periments by Hilbe et ah [12] (red) and in simulations (blue) with N = 10201 
players. The pink shaded region represents the standard errors from the 
experiments. 


opponents, respectively. The hxed points p* : r^p*)^ = r(p*)^ for both cases are un¬ 
stable, players with initial p above this point will tend to increase their cooperation 
probabilities and vice versa. In ensembles of players with widely sprad initial conditions 
the distribution will split, the respective cooperation probabilities will move towards the 
boundaries, and after many rounds one expects ’U-shaped’ distributions of cooperation 
propensities. Ensembles with initial conditions above the hxed point will converge to¬ 
wards cooperation, which has indeed been observed [22]. However, since the hxed points 
for extortion and generosity are quite similar this case is not able to explain the results 
from Hilbe et ah [12]. 

Case a) serves to demonstrates the basic ehect: reward dependent adaptation of p 
can induce multistability and not neccessarily leads to maximization For this case of 
no history dependency of the players cooperation propensity the diherence between 
extortionist and generous opponents is small. Since the unstable hxed point are nearly 
at the same position(r(p*) = 0.69 in the extortion and r{p*) = 0.65 in the generous case) 
human subjects would act similar if playing against the diherent treatments. 

Case b) with Q G {C^D} is more useful to explain the results by Hilbe etal. even 
if no cooperation bias is present {sc = £d)- Case c) leads to a similar dynamic as 
case b) and is therefore not shown. Figure 2 compares the fraction of cooperating 
subjects during the course of the game playing against extortion and generous treatments 
with the respective simulations of the model. In the experiment [12] the humans start 
with a cooperation rate of 30-40%. In the generous treatment the subjects behave 
”rationally “, i.e. they optimise their strategy towards cooperation. At the end they 
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cooperative acts during the last ten rounds 
Experiment strong generosity Simulation 



cooperative acts during the last ten rounds 


Figure 3: Cooperation acts during the last ten rounds. Left part of the hgure taken from 
[ 12 ]. 


reach a cooperation probability of 70-80%, which leads to a mean payoff of 0.27€ 
which is not far away from the best possible payoff 0.30 €. Against extortion strategies 
humans behaved differently. Their cooperation probability dropped to low values (30- 
40%) and they received only a mean reward of 0.15€ which is quite far away from the 
maximally possible payoff 0.22€. The simulations lead to a very similar behaviour, in 
the generous case the cooperation probability starts at 47% and goes up to 76%. In 
the extortion case the agents also start with a cooperation probability of 47% which 
decreases during the course of the game and ends with a probability of 24%. In both 
cases the respective propensities of cooperation are fully consistent with the data, with an 
increasing cooperation probability in the generous and a slightly decreasing cooperation 
probability in the extortion case. 

Figure 3 compares the behaviour of the model after 60 rounds with the results from 
Hilbe etal [12]. In the strong extortion case the results of the model match the experi¬ 
mental data, the peak at zero has nearly the same height. The generous cased matches 
qualitatively, the large peak at ten appears in both, the simulation and the experiment. 
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Figure 4: Flow in the strong extortion and generous case with the unstable (black dots) 
and stable manifolds (red dots and lines) and the corresponding zones of at¬ 
traction (black lines). The green curves are exemplary movements of single 
agents with the starting point (0.5, 0.5) over 200 rounds 


but in the simulation this peak is smaller than in the experiment. 

To understand the dynamics of the model we calculate the expected changes of both 
choice probabilities pc and pr, (see Methods) and display them depending on pc and 
Pd as a vector held (i.e. the how in choice probability space as in Izquierdo etal. [21]). 
Figure 4 shows the unstable and stable hxed points and the corresponding basins of 
attraction of this how. There are four unstable hxed points and three stable manifolds 
with basins of attraction that are delimited by the nullclines. 

For the strongly generous opponent we hnd a single unstable hxed point within the 
preference space at (p|), p^) = (0.38, 0.4). It determines an edge of the basin of attraction 
for the stable hxed point at the lower left edge at (0, 0), together with the hxed point at 
the left boundary at (0,0.45) and at the lower boundary at (0.41,0). This latter stable 
hxed point is the state of permanent defection. The basin of attraction for the stable 
hxed point at the lower right edge at (1,0) is formed by the central and the left hxed 
point and the hxed point at the right boundary at (1,0.38). At this hxed point the 
agents always take the diherent action as the opponent did in the last round, so we can 
call this behaviour “anti-Tit-for-Tat”. 

The stable manifold most relevant for the experiment with generous opponents is the 
upper one at = 1. The corresponding basin of attraction is bounded by the left, 
central and right unstable hxed point. On the one hand the basin of attraction of this 
manifold has the largest overlap with the initial conditions (see Methods), so most of the 

















Table 2; Cooperation probabilities of the ZD strategies, is the cooperation proba¬ 
bility of the ZD strategy if the ZD strategy has cooperated and the opponent 
has defected in the previous round. 


Strategies 

Pcc 

PcD 

Pdc 

Pdd 

strong extortion 

0.692 

0.000 

0.538 

0.000 

strong generous 

1.000 

0.182 

1.000 

0.364 


agents will go up towards this stable manifold. At the other hand agents in this manifold 
always cooperate if the opponent has cooperated in the last round, but because generous 
strategies also behave like this (see table 2) this is a point of permanent mutual cooper¬ 
ation. This explains why nearly every human in this experiment and the largest fraction 
of agents in the simulation cooperated permanently at the end of the game (see hgure 3). 

For the strong extortion opponents the hxed points are similar to those of the strong 
generous case. However, the central unstable hxed point (here at (0.47, 0.44)) together 
with the upper (0.44,1), the lower (0.49,0) and the right (1,0.4) unstable hxed points 
now divide the space into three distinct basins of attraction. All agents, which are 
initialized in the upper right basin will tend to go to the stable hxed point at (1,1) of 
permanent cooperation. In contrast, the agents which start in the lower right basin will 
move towards the anti-Tit-for-Tat hxed point at (1,0). 

The agents, who start at the left basin will tend to go to the large stable manifold 
at pD = 0, which is most important for understanding the experimental results for this 
case. On the one hand the basin encompasses the initialization area and on the other 
hand the agents always defect if the opponent has defected in the previous round. Since 
extortion strategies do the same (see table 2) this leads to a state of permanent mu¬ 
tual defection. This explains why the largest fraction of subjects in the experiment and 
agents in the simulation prefer defection at the end of the game(see hgure 3). 

A critical prediction of our model is that the evolution of the players choices should 
depend on their respective individual initial conditions. We tested this property of our 
model by comparing the hnal cooperation rates of individual subjects with the corre¬ 
sponding predictions of our model which were based on the subjects’ observed choices 
during the hrst ten rounds (see Methods). Figure 5 summarises the results for all games 
and every subject. The linear regression (solid line) leads to a function with a slope of 
0.69 and a y-intercept of 0.26. 

In principle, the how held can be estimated directly from experimental data which 
will allow to experimentally test diherent hypothetical learning rules. To demonstrate 
this, we used a naive method (see Methods) with computer generated data and found 
that a good match with the analytic how can be achieved already from 500 agents in a 
game of 200 rounds (hgure 6). 
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• strong generosity • strong extortion 

• mild generosity •mild extortion 



prediction of the model 


Figure 5: Fraction of cooperative acts during the last ten rounds of each subject in all 4 
ZD-treatments from the experiments of Hilbe et ah [12] versus the predictions 
of the model. The solid line is the linear regression having a slope of 0.69 and 
a y-intercept of 0.26. 

Discussion 

The dynamics of choice probabilities was analysed for a simple model of learning in 
subjects playing iterated prisoner dilemma games against zero-determinant strategies. 
The flow in choice preference space exhibits characteristic basins of attraction: depending 
on their initial readiness to cooperate after many rounds of the game players tend to 
either always cooperate or to always defect. Thereby small individual differences in 
cooperation rates may lead to opposite behaviours in the long run. While the insight 
that learning can lead to extreme behaviours is not new [21, 23], the general method 
introduced here allows to determine the potential influence of the dynamics on systematic 
deviations from optimal behaviour for all environments that depend on few past moves. 
For instance, we suspect that many of the systematic deviations from both, utility 
maximization and matching observed experimentally as in [1] can be explained by our 
framework. 

When playing against opponents with zero-determinant strategies a longer memory 
beyond the last move cannot improve the payoff [10], an intriguing fact, which makes 
these games a benchmark for investigating the joint dynamics of learning and environ¬ 
ment. With a plausible initial distribution of cooperation probabilities the model is able 
to quantitatively explain the full phenomenology of the population data gathered by 
Hilbe et ah [24]. In particular, it reproduces the average dynamics of cooperation rates 
and explains the wide distributions of cooperativity among individual players. 
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Our results are robust with respect to many details of the learning rule. As a test we 
applied the full Bush-Mosteller Rule [18] and obtained similar results when the aspiration 
parameter in the BMR was small. The aspiration is an offset to which the received payoff 
is compared and only the difference is used for reinforcement of the choice probabilities. 
Since the subjects in the experiment of Hilbe et al. were payed according to the absolute 
payoff of the prisoners dilemma game and had no explicit costs setting this parameter to 
zero is a plausible choice. If, however, the subjects would need to compare the received 
rewards to some explicit costs our model would predict the emergence of stable hxed 
points in preference space located at intermediate values of cooperation rates. This 
prediction of our approach could easily be tested in future experiments. 

A central prediction of our model that can be tested already on the data available 
from [24] is that the asymptotic cooperation rates of individual players should depend on 
their initial tendency to cooperate. This critical prediction was conhrmed to a striking 
degree: based on the sequence of choices of individual subjects during the hrst ten 
rounds of the games the model’s predictions closely matched the average fraction of 
cooperative acts in the last ten rounds (hgure 5). Prediction improved with a non-zero 
bias for successful cooperation in the learning rule, which is equivalent to assuming an 
additional value of mutual cooperation [25]. These results demonstrate that the proposed 
learning mechanism not only qualitatively captures the dynamics of choices. We did not 
attempt to further optimize the two free parameters of the model (learning rate and 
cooperation bias), neither did we perform an extensive search for better initializations 
of the cooperation rates which obviously can only improve these results. Also, we did 
not consider different individual learning rates, which can further improve the match to 
experimental data, however, at the cost of additional model parameters. 

We believe, however, that the prediction performance of our model will improve more 
substantially if the initial cooperation propensity and the learning rate could be esti¬ 
mated directly for each subject before the game starts. Then, given the model is correct, 
only the noise intrinsic to players and opponents would limit its predictability. In princi¬ 
ple, also the flow itself can be estimated from experimental data. Using a naive method 
flow estimation in simulations (see Methods) became signihcant with 500 players making 
200 moves. This second approach demonstrates that experiments are within reach, that 
would directly test hypothetical learning rules. We plan to perform these tests in the 
next future. Because the method of determining the flow from groups of players can be 
implemented for any learning rule it can in principle also be used for determining char¬ 
acteristics of learning implemented in subjects from selected groups. We speculate, that 
it would thereby become possible to investigate the relation of pathological behaviour 
(as e.g. in addiction) to systematic aberrations of learning. 

Taken together, our results suggest a fundamental role of the joint dynamics of learn¬ 
ing and environment for explaining experimentally observed deviations from optimal 
decision behaviour that has previously been overlooked. 
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Methods 


Model 

Since the learning rule (1) can lead to probabilities smaller than zero and larger then 
one we always set the cooperation probability to zero and one, respectively, if these 
boundaries are passed. The learning rates are ec = 0.09375 and = 0.03125 and all 
simulations were performed with N = 101^ = 10201 agents. For the initial conditions we 
assume, that human beings rather cooperate if there opponent has cooperated as well 
respectively defect if the opponent has defected as well. That is we choose a uniform 
distribution with the borders 0 < < 0.45 and 0.45 < < 1. In the first round we 

choose one of the two cooperation probabilities randomly. 

Mean Field Theory 

To understand the results of the model, we calculate the expectation value =< > 

of equation (1). ay^t only depends on Y’s behaviour in the current round, the reward 
r^y depends on both players behaviour in the current round and the Kronecker delta 
Sx'k depends on X’s behaviour in the previous round. The behaviour of X and Y in 
the current round are called x and p, respectively, and in the previous round x' and y'. 
Fk only depends on x^y and x'. To calculate this expectation value we need the joint 
probability 7i{x, y, x')\ 

Fk^ T^{x,y,x')ayrxy5x'k- ( 2 ) 

{x,y,x'}(i{C,D} 

To calculate 7r{x,y,x') we sum over the joint probabilities which include the full infor¬ 
mation of the last and current round: 

7r{x,y,x')= Tr{x,y,x',y') = Y 7^{x,y\x',y')Sx'y'■ (3) 

y'&{C,D} y'&{C,D} 

Sx'y! is the probability of the occurrence of the state {x',y'). A repeated prisoner’s 
dilemma can be seen as a Markov process. The only problem is, that Markov processes 
assume constant transition probabilities which is not the case in our model because of 
the learning rule. We now assume that this process is a quasistatic process, which allows 
us to use the steady state probabilities of the Markov process. These state probabilities 
can easily be calculated by solving the following eigenvalue equation: 

PccPc PcdPc PdcPd PddPd 

P^c^-pI) P^d^-pI) 

(1 - p^c)pI (1 - p^d)pI (1 - p^c)pI (1 - pUpI 
(1-P^c)(l-P^) il-P^c)i^-pl) (l-pUi^-pl) 

( 4 ) 
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where is the cooperation probability of the ZD strategy, if the ZD strategy has 
cooperated and the human opponent has defected in the previous round. Quasistatic 
processes in general are processes which have an inhnitely long break after each parame¬ 
ter change in which the system can equilibrate so that there is a permanent equilibrium. 
In our case of a repeated prisoner dilemma this means that after every round with a 
parameter change we assume an inhnite number of rounds with no parameter change so 
that the system can equilibrate. 

The probability n{x,y\x',y') is the probability that in the current round the state 
{x,y) occurs, if in the previous round the state {x',y') occurred. This probability can 
be calculated with the known cooperation probabilities of X and Y: 


TT{C,C\x',y') 

= Px'y'Pl' 

(5) 

n{D,C\x',y') 

= (1 - Px'y')pl' 

(6) 

7r{C,D\x',y) 

= Px'y'i^ - pI') 

(7) 

7i{D,D\x',y') 

= (1 -P^y)(l -Px') 

(8) 


Prediction of Final Cooperativity 

For predicting hnal cooperation rates c* from initial cooperation rates c* we calculate 
numerically the probability P{cs, Ce)of the model that a randomly chosen agent has the 
cooperativity (cg, Ce). With this probability we can calculate the conditional probability 

^(Ce|Cs) 


f^(Ce|Cs) 


f^(Cs, Ce) 

/q P(Cs,Ce)dCs 


and the conditional expectation value 


^(Ce|Cs) 


P(Ce|Cs)CedCe 


(9) 


( 10 ) 


Plotting the conditional expectation value of the predicted cooperation rates in the 
last ten rounds versus the experimental data (P(ce|cQ,cQ should lead to a distribution 
similar to the angle bisector if the model is correct. 


Measurement of the Flow 


To check whether the learning rule is correct or not we want to estimate the flow from 
experimental data. Therefore we divide the sequence into several parts (which can 
overlap) and calculate the mean value for and in each part. 


Y 

Pc = 

ncc 

(11) 

ncc + nnc 

V 

Pd = 

ncD' 

(12) 

ncD' + noD' 
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e.g. ncD' is the number of times the human subject cooperates in the current round, 
if the computer has defected in the previous round. If the sequences are long enough 
we can assume, that the differences of the mean values is caused by the change of the 
cooperation probabilities and not by noise. In the case that one agent only cooperates 
or defects it is only possible to estimate one of the two probabilities. In this case we 
assume that the other cooperation probability is unchanged and we use the probability 
of the previous part. If this appears at the beginning of the game we take the next 
probability. 

We tested this very simple method with computer generated data of 500 agents in a game 
of 200 rounds. This is a large, but realistic experimental number of rounds. If the human 
subjects have e. g. 15 seconds for each decision, which is quite much for pressing one of 
two buttons, they would need 50 minutes. An experiment with 500 human subjects is 
a lot of work, but every subject plays independently against a computer. The humans 
do not have to play at the same time, so the amount of work increases linearly with the 
number of human subjects. 

The results of this measurement are shown in hgure 6. To determine the quality of this 
method we calculate the coherence of the analytic flow Fa and the measured flow Fn: 

(Fa ■ F„)' 

"" (Fa • Fa) (Fn • F^) 

The largest coherences were found at a part length of 110 rounds and an overlap between 
these parts of 90% (so part one goes from round 1 to 110, part two from round 12 to 
122 etc.). The coherence for the shown examples are 0.40 and 0.53, which are typical for 
the strategies. Especially at the border the analytic and measured flow do not match 
well. This is cause by the boundary condition which is not used in the calculation of the 
flow but in the computer generated data. 

The coherences of the extortion cases are signihcantly higher than in the generous 
cases. This is caused by the left part of the flow (p)) = 0 ... 0.5) where nearly all arrows 
are parallel, so errors in does not lead lo a lower coherence. 
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